Search Results

Documents authored by Shen, Chengze


Document
Abstract
EMMA: Adding Sequences into a Constraint Alignment with High Accuracy and Scalability (Abstract)

Authors: Chengze Shen, Baqiao Liu, Kelly P. Williams, and Tandy Warnow

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)


Abstract
Multiple sequence alignment (MSA) is a crucial precursor to many downstream biological analyses, such as phylogeny estimation [Morrison, 2006], RNA structure prediction [Shapiro et al., 2007], protein structure prediction [Jumper et al., 2021], etc. Obtaining an accurate MSA can be challenging, especially when the dataset is large (i.e., more than 1000 sequences). A key technique for large-scale MSA estimation is to add sequences into an existing alignment. For example, biological knowledge can be used to form a reference alignment on a subset of the sequences, and then the remaining sequences can be added to the reference alignment. Another case where adding sequences into an existing alignment occurs is when new sequences or genomes are added to databases, leading to the opportunity to add the new sequences for each gene in the genome into a growing alignment. A third case is for de novo multiple sequence alignment, where a subset of the sequences is selected and aligned, and then the remaining sequences are added into this "backbone alignment" [Nguyen et al., 2015; Park et al., 2023; Shen et al., 2022; Liu and Warnow, 2023; Park and Warnow, 2023; Yamada et al., 2016]. Thus, adding sequences into existing alignments is a natural problem with multiple applications to biological sequence analysis. A few methods have been developed to add sequences into an existing alignment, with MAFFT--add [Katoh and Frith, 2012] perhaps the most well-known. However, several multiple sequence alignment methods that operate in two steps (first extract and align the backbone sequences and then add the remaining sequences into this backbone alignment) also provide utilities for adding sequences into a user-provided alignment. We present EMMA, a new approach for adding "query" sequences into an existing "constraint" alignment. By construction, EMMA never changes the constraint alignment, except through the introduction of additional sites to represent homologies between the query sequences. EMMA uses a divide-and-conquer technique combined with MAFFT--add (using the most accurate setting, MAFFT-linsi--add) to add sequences into a user-provided alignment. We evaluate EMMA by comparing it to MAFFT-linsi--add, MAFFT--add (the default setting), and WITCH-ng-add. We include a range of biological and simulated datasets (nucleotides and proteins) ranging in size from 1000 to almost 200,000 sequences and evaluate alignment accuracy and scalability. MAFFT-linsi--add was the slowest and least scalable method, only able to run on datasets with at most 1000 sequences in this study, but had excellent accuracy (often the best) on those datasets. We also see that EMMA has better recall than WITCH-ng-add and MAFFT--add on large datasets, especially when the backbone alignment is small or clade-based.

Cite as

Chengze Shen, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. EMMA: Adding Sequences into a Constraint Alignment with High Accuracy and Scalability (Abstract). In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 2:1-2:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{shen_et_al:LIPIcs.WABI.2023.2,
  author =	{Shen, Chengze and Liu, Baqiao and Williams, Kelly P. and Warnow, Tandy},
  title =	{{EMMA: Adding Sequences into a Constraint Alignment with High Accuracy and Scalability}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{2:1--2:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.2},
  URN =		{urn:nbn:de:0030-drops-186285},
  doi =		{10.4230/LIPIcs.WABI.2023.2},
  annote =	{Keywords: Multiple sequence alignment, constraint alignment, MAFFT}
}
Document
Abstract
BATCH-SCAMPP: Scaling Phylogenetic Placement Methods to Place Many Sequences (Abstract)

Authors: Eleanor Wedell, Chengze Shen, and Tandy Warnow

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)


Abstract
Phylogenetic placement is the problem of placing one or more query sequences into a phylogenetic "backbone" tree, which may be a maximum likelihood tree on a multiple sequence alignment for a single gene, a taxonomy with leaves labeled by sequences for a single gene [Nidhi Shah et al., 2021], or a species tree [Jiang et al., 2023]. When the backbone tree is a tree estimated on a single gene, the most accurate techniques for phylogenetic placement are likelihood-based, and can be computationally intensive when the backbone trees are large [Chu and Warnow, 2023]. Phylogenetic placement into gene trees occurs when updating existing gene trees with newly observed sequences, but can also be applied in the "bulk" context, where many sequences are placed at the same time into the backbone tree. For example, phylogenetic placement can be used to taxonomically characterize shotgun sequencing reads generated for an environmental sample in metagenomic analysis [Nidhi Shah et al., 2021; Barbera et al., 2019]. The two most well known maximum likelihood phylogenetic placement methods are pplacer [Chu and Warnow, 2023] and EPA-ng [Barbera et al., 2019]. Of these two, EPA-ng is optimized for scaling the number of query sequences and is capable of placing millions of sequences into phylogenetic trees of up to a few thousand sequences [Barbera et al., 2019], and achieves sublinear runtime in the number of query sequences (see Figure 2 from [Balaban et al., 2022]). Previously we introduced the SCAMPP framework [Wedell et al., 2022] to enable both pplacer and EPA-ng to perform phylogenetic placement into ultra-large backbone trees, and we demonstrated its utility for placing into backbone trees with up to 200,000 sequences. By using maximum likelihood methods pplacer or EPA-ng within the SCAMPP framework, the resulting placements are more accurate than with APPLES-2 [Balaban et al., 2022], with the most notable accuracy improvement for fragmentary sequences, and are computationally similar for single query sequence placement [Wedell et al., 2022]. However, SCAMPP was designed to incrementally update a large tree, one query sequence at a time, and was not optimized for the other uses of phylogenetic placement, where batch placement of many sequencing reads is required. Here we introduce BATCH-SCAMPP, a technique that improves scalability in both dimensions: the number of query sequences being placed into the backbone tree and the size of the backbone tree. Furthermore, BATCH-SCAMPP is specifically designed to improve EPA-ng’s scalability to large backbone trees. Although BATCH-SCAMPP is based on SCAMPP, it uses a substantially modified design in order to be able to take advantage of EPA-ng’s ability to place many query sequences efficiently. The BATCH-SCAMPP method operates by allowing the input set of query sequences to suggest and then vote on placement subtrees, thus enabling many query sequences to select the same placement subtree. We pair BATCH-SCAMPP with EPA-ng to explore the capability of this approach for scaling to many query sequences. We show that this combination of techniques (which we call BSCAMPP+EPA-ng, or BSCAMPP(e)) not only provides high accuracy and scalability to large backbone trees, matching that of SCAMPP used with EPA-ng (i.e., SCAMPP(e)), but also achieves the goal of scaling sublinearly in the number of query sequences. Moreover, it is much more scalable than EPA-ng and faster than SCAMPP+EPA-ng: when placing 10,000 sequences into a backbone tree of 50,000 leaves, EPA-ng is unable to run due to memory issues, SCAMPP+EPA-ng requires 1421 minutes, and BSCAMPP(e) places all sequences in 7 minutes (all given the same computational resources. Figure 1 gives an example of this performance advantage on the nt78 [Chu and Warnow, 2023] simulated dataset.

Cite as

Eleanor Wedell, Chengze Shen, and Tandy Warnow. BATCH-SCAMPP: Scaling Phylogenetic Placement Methods to Place Many Sequences (Abstract). In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 3:1-3:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{wedell_et_al:LIPIcs.WABI.2023.3,
  author =	{Wedell, Eleanor and Shen, Chengze and Warnow, Tandy},
  title =	{{BATCH-SCAMPP: Scaling Phylogenetic Placement Methods to Place Many Sequences}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{3:1--3:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.3},
  URN =		{urn:nbn:de:0030-drops-186296},
  doi =		{10.4230/LIPIcs.WABI.2023.3},
  annote =	{Keywords: Phylogenetic Placement, EPA-ng, Phylogenetics}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail